Downloading microbiome sequences from SRA

Brief Intro

Microlbiome read sequencing data may be obtained from different sources. The most common ones include:

  1. Reads directly from a sequencing platforms.
  2. Reads downloaded from the Sequence Read Archive (SRA) or the European Nucleotide Archive (ENA).
  3. Reads synthesized using sequencing simulators.

Snakemake workflow rules


Tentative snakemake workflow



Setting up SRA Toolkit

Quick glimpse

The NCBI Sequence Read Archive (SRA) stores sequencing data from the next generation sequencing platforms. Users can download data from the SRA archive using the SRA toolkits or custom computational methods.

Demo Installing SRA Toolkit on Mac OS.

Download sratoolkit

  • Navigate to where you want to install the tools, preferably the home directory.
  • For more information click here.
curl -LO  https://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/3.0.0/sratoolkit.3.0.0-mac64.tar.gz
tar -xf sratoolkit.3.0.0-mac64.tar.gz
export PATH=$HOME/sratoolkit.3.0.0-mac64/bin/:$PATH

Create a cache root directory

mkdir -p ~/ncbi
echo '/repository/user/main/public/root = "cache_directory"' > ~/ncbi/user-settings.mkfg

Confirm sra toolkit configuration

  • The vdb-config -i command below will display a blue colored dialog.
  • Use tab or click c to navigate to cache tab.
  • Review the configuration then save s and exit x.
vdb-config -i

A screenshot of the SRA configuration.


For more information click here.

Using already installed sratools

We can create an environment and install essential toolkits (Refer IMAP-PART 01)

name: sradb
channels:
  - conda-forge
  - bioconda
dependencies:
  - snakemake =7.19.1
  - snakemake-minimal =7.19.1
  - snakedeploy =0.8.6
  - sra-tools
  - entrez-direct
  - pysradb
  - insilicoseq =1.5.4
  - seqkit =2.3.1
mamba create -bioconda -conda-forge sradb -file environment.yml


Downloading multiple fastq files

Using fasterq-dump

  • Be sure that the fasterq-dump is in the path.
  • Type which fasterq-dump or fasterq-dump --help to confirm.
  • Must specify the output and temporary files.
  • It is possible to specifies a range of SRA accessions to use in a for loop.

Example code for download reads for SRA accessions ranging from SRR7450706 to SRR7450761

for (( i = 706; i <= 761; i++ ))
    do
        time fasterq-dump SRR7450$i \
        --split-3 \
        --force \
        --skip-technical \
        --outdir data/reads \
        --temp data/temp \
        --threads 4     
    done

Compressing and uncompressing files

The microbiome fastq files are usually very large. Compressing them may save lots of space.

Example syntaxies

gunzip data/reads/*.gz

gzip data/reads/*.fastq


How to resize Fastq files

Purpose

  • Sometimes we want to extract a small subset to test the bioinformatics pipeline.
  • You can resize the fastq files using the seqkit sample function[seqkit2022?].
  • Below is a quick demo for extracting only 1% of the paired-end metagenomics sequencing data.

Example

This example extract 1% of the reads in only two sample (SRR10245277 & SRR10245278)

mkdir -p data
for i in {77..78}
  do
    cat SRR102452$i\_R1.fastq \
    | seqkit sample -p 0.01 \
    | seqkit shuffle -o data/SRR102452$i\_R1_sub.fastq \
    | cat SRR102452$i\_R2.fastq \
    | seqkit sample -p 0.01 \
    | seqkit shuffle -o data/SRR102452$i\_R2_sub.fastq
  done





References

[1]
Buza, T. M., Tonui, T., Stomeo, F., Tiambo, C., Katani, R., Schilling, M., … Kapur, V. (2019). iMAP: An integrated bioinformatics and visualization pipeline for microbiome data analysis. BMC Bioinformatics, 20. https://doi.org/10.1186/S12859-019-2965-4



Appendix

Project main tree

.
├── LICENSE
├── README.md
├── config
│   ├── config.yaml
│   ├── samples.tsv
│   └── units.tsv
├── dags
│   ├── rulegraph.png
│   └── rulegraph.svg
├── data
│   ├── metadata
│   ├── reads
│   ├── temp
│   └── test
├── docs
│   └── env_spec_file.txt
├── images
│   ├── smkreport
│   ├── sra.png
│   └── sra_config_cache.png
├── index.Rmd
├── library
│   ├── apa.csl
│   ├── imap.bib
│   └── references.bib
├── report.html
├── results
│   ├── project_tree.txt
│   └── run_accessions.txt
├── styles.css
└── workflow
    ├── Snakefile
    ├── envs
    ├── rules
    ├── schemas
    └── scripts

17 directories, 19 files

Screenshot of interactive snakemake report

The interactive snakemake HTML report can be viewed by opening the report.html using any compatible browser. You will be able to explore the workflow and the associated statistics. You can close the left bar to get a more expansive display view.

Troubleshooting of FAQs

  1. Question
    • Answer
  2. Question
    • Answer